Link-Based Clustering for Finding Subrelevant Web Pages

نویسندگان

  • Tomonari Masada
  • Atsuhiro Takasu
  • Jun Adachi
چکیده

We propose a new Web page clustering. Typical search engines only provide relevant pages, i.e., the pages matching users’ needs. However, we design our clustering method to provide non-relevant pages as search results when they refer to relevant pages and help users anticipate the contents of those relevant pages. We call such pages subrelevant. As it is difficult to improve Web search performance, we use subrelevancy to relax the criterion as to what kind of pages should appear in search results with the least drawback, i.e., one click away from a relevant page. Our clustering method is based on three concepts: THP, out-degree path length, and threshold parameter. We use clustering results to modify the feature vectors of Web pages. Hence, each clustering result induces a reranking of search results. We expect the reranking to raise the ranks of subrelevant pages. In the experiments with NTCIR-3 Web task test collection, our clustering largely improved the average precision by 13 percent in comparison with the baseline.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Finding Community Base on Web Graph Clustering

Search Pointers organize the main part of the application on the Internet. However, because of Information management hardware, high volume of data and word similarities in different fields the most answers to the user s’ questions aren`t correct. So the web graph clustering and cluster placement in corresponding answers helps user to achieve his or her intended results. Community (web communit...

متن کامل

Use of Semantic Similarity and Web Usage Mining to Alleviate the Drawbacks of User-Based Collaborative Filtering Recommender Systems

  One of the most famous methods for recommendation is user-based Collaborative Filtering (CF). This system compares active user’s items rating with historical rating records of other users to find similar users and recommending items which seems interesting to these similar users and have not been rated by the active user. As a way of computing recommendations, the ultimate goal of the user-ba...

متن کامل

An Overview of Web Data Clustering Practices

Clustering is a challenging topic in the area of Web data management. Various forms of clustering are required in a wide range of applications, including finding mirrored Web pages, detecting copyright violations, and reporting search results in a structured way. Clustering can either be performed once offline, (independently to search queries), or online (on the results of search queries). Imp...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Submitted in partial fulfillment of the requirements for the degree of Masters of Arts 2006

We present a highly accurate method for classifying web pages based on link percentage, which is the percentage of text characters that are parts of links normalized by the number of all text characters on a web page. K -means clustering is used to create unique thresholds to differentiate index pages and article pages on individual web sites. Index pages contain mostly links to articles and ot...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005